Complex Linguistic Features for Text Classification: A Comprehensive Study

نویسندگان

  • Alessandro Moschitti
  • Roberto Basili
چکیده

Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by Natural Language Processing (NLP) techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC) are available fewer and less definitive studies on the use of advanced document representations as it is a relatively new research area (compared to document retrieval). In this paper, advanced document representations have been investigated. Extensive experimentation on representative classifiers, Rocchio and SVM, as well as a careful analysis of the literature have been carried out to study how some NLP techniques used for indexing impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Linguistic Features of English Textese and Digitalk of Iranian EFL Students

This study aimed at investigating the English textese of Iranian EFL learners by scrutinizing the linguistic features through a qualitative design. In doing so, 700 messages were collected from 43 MA Iranian EFL learners of both genders. The features were categorized and analyzed calculating the frequency and percentage. The findings of the study showed that Iranian EFL students used different ...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Generic Analysis of Literary Translation: A Case Study of Contemporary English Short Stories

Translation of a literary text is a difficult task, for understanding literature requires knowledge of various linguistic levels of a literary text in addition to strategies and methods of translation. To this should still be added cognitive-based translation training which helps practitioners preserve the aesthetic aspects of a literary text. Focusing on short story as a genre with both ...

متن کامل

The Effects of Task Complexity on Input-Driven Uptake of Salient Linguistic Features

The present study investigated the effects of cognitive complexity of pedagogical tasks on the learners’ uptake of salient features in the input. For the purpose of data collection, three versions of a decision-making task (simple, mid, and complex) were employed. Three intact classes (each 20 language learners) were randomly assigned to three groups.  Each group transacted a version of a decis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004